A modular data analysis pipeline for the discovery of novel RNA motifs
نویسندگان
چکیده
This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shortened segments of RNA primary sequence. The shortened segments are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns. An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called non-linear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produce 2-dimensional projections of the distance matrices which are examined via inspection and fc-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences.
منابع مشابه
P-215: Discovery of A Novel APA Variant of A Human Potential Gene Based on Expressed Sequenced Tags Analysis
Background: Expressed sequence tags (ESTs) are sequences of cDNA fragments prepared from different tissue sources. There are over one million of these sequences in the publicly available database, and these sequences are believed to represent more than half of all human genes. The ESTs belong to different cDNA libraries, was prepared from one particular cell type, organ, or tumor. Therefore, th...
متن کاملA Novel Hybrid-Excited Modular Variable Reluctance Motor for Electric Vehicle Applications: Analysis, Comparison, and Implementation
A variable reluctance machine (VRM) has been proven to be an outstanding candidate for electric vehicle (EV) applications. This paper introduces a new double-stator, 12/14/12-pole three-phase hybrid-excited modular variable reluctance machine (MVRM) for EV applications. In order to demonstrate the superiorities of the proposed structure, the static torque characteristics and dynamic performance...
متن کاملDiscovering cis-Regulatory RNAs in Shewanella Genomes by Support Vector Machines
An increasing number of cis-regulatory RNA elements have been found to regulate gene expression post-transcriptionally in various biological processes in bacterial systems. Effective computational tools for large-scale identification of novel regulatory RNAs are strongly desired to facilitate our exploration of gene regulation mechanisms and regulatory networks. We present a new computational p...
متن کاملPipeline upheaval buckling in clayey backfill using numerical analysis
Offshore pipelines used for oil and gas transportation are often buried to avoid damage from fishing activities and to provide thermal insulation. Thermal expansion and contraction of the pipeline during operation can lead to lateral or upheaval buckling. A safe buried pipeline design must take into account a reliable evaluation of soil uplift resistance and pipe embedment depth. While the cost...
متن کاملA Computational Pipeline for High- Throughput Discovery of cis-Regulatory Noncoding RNA in Prokaryotes
Noncoding RNAs (ncRNAs) are important functional RNAs that do not code for proteins. We present a highly efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. W...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015